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Abstract 

We investigate the problem of learning XML queries, path queries and twig queries, from 
examples given by the user. A learning algorithm takes on the input a set of XML documents 
with nodes annotated by the user and returns a query that selects the nodes in a manner 
consistent with the annotation. We study two learning settings that differ with the types of 
annotations. In the first setting the user may only indicate required nodes that the query 
must select (i.e., positive examples). In the second, more general, setting, the user may also 
indicate forbidden nodes that the query must not select (i.e., negative examples). The query 
may or may not select any node with no annotation. 

We formalize what it means for a class of queries to be learnable. One requirement is 
the existence of a learning algorithm that is sound i.e., always returning a query consistent 
with the examples given by the user. Furthermore, the learning algorithm should be com- 
plete i.e., able to produce every query with sufficiently rich examples. Other requirements 
involve tractability of the learning algorithm and its robustness to nonessential examples. 
We identify practical classes of Boolean and unary, path and twig queries that are learnable 
from positive examples. We also show that adding negative examples to the picture renders 
learning unfeasible. 

1 Introduction 

XML has become a de facto standard for representation and exchange of data in web appli- 
cations. An XML document is basically a labeled tree whose leaves store textual data and the 
standard XML format is text based to allow users an easy and direct access to the contents of the 
document [32] • However, to satisfy even modest information needs, the user is often required to 
formulate her queries using one of existing query languages whose common core is XPath [4"51, |4"4"] . 
XPath queries allow to access the contents of the desired nodes with a syntax similar to directory 
paths used to navigate in the UNIX file system. Unfortunately, even the XPath query language, 
and any language with formal syntax, might be too difficult to be accessible to every user, and 



in general, there is a lack of frameworks allowing the user to formulate the query without the 
knowledge of a specialized query language. 

In this paper, we propose to address this gap with the help of algorithms that infer the query 
from examples given by the user. We remark, however, that the need for general inference of 
XML queries is justified by other novel database applications. For instance, in the setting of 
XML data exchange [B] the pattern queries used to define data mappings need to be specified by 
the user. A learning algorithm could be a base for real ad-hoc data exchange solutions, where the 
pattern queries defining mappings are inferred as new sources are discovered. Another example 
of potential application is wrapper induction (20] [39] . 

The problem of XML query learning is defined as follows: given an XML document with nodes 
annotated by the user construct a query that selects the nodes accordingly to the annotations. 
Clearly, this problem has two parameters: the class of queries within which the algorithm should 
produce its result and the type of annotations the user may use. In the current work we focus 
on two well-known subclasses of XPath: twig and path queries pQ. We identify two types of 
annotations: required nodes i.e., nodes that need to be selected by the query, and forbidden 
nodes i.e., those that the query must not select. Because we do not require all nodes to be 
annotated, every unannotated node is implicitly annotated as neutral, which means that the 
query may or may not select it. In terms of computational learning theory |22j . a required node 
is called a positive example and a forbidden node is a negative example. In this paper, we consider 
two settings: one, where the user provides only positive examples, and a more general one, where 
both positive and negative examples are present. 

Example 1 Take for instance the XML document in Figure [1] with a library listing. Some of 
its elements are annotated as required (+) and some as forbidden (— ). 
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Figure 1: Annotation of a library database 



The query that the user might want to receive is one that selects the titles of works by K. 
Marx: 

q = /library/*[author="K. Marx"] /title. 

The query /library/*[author="K. Marx"]/* is also consistent with the annotation but it prop- 
erly contains qo- This makes go more specific w.r.t. the user annotations, and therefore, may be 
better fitted for the results of learning. The query selecting titles of all works, /library/*/title 
is not consistent because it selects the forbidden title node. The query /library/book[author="K. Marx"]/title 
is also not consistent with the annotation because it does not select the required title node of 
Capital. □ 

Our study requires us to define precisely what it means for a class of queries Q to be learn- 
able. We propose a definition influenced by computational learning theory |22) . and inference of 
languages in particular [21] [32] [13] . First of all, for Q to be learnable there must exist a learning 
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algorithm learner which on the input takes a sample S i.e., a set of examples, and returns a 
query q € Q. Naturally, learner should be sound, that is the query q must be consistent with the 
sample S. Because the soundness condition is not enough to filter out trivial learning algorithms 
(cf. discussion following Definition [2]) , we furthermore require learner to be complete, that is able 
to learn every query with sufficiently informative examples. More precisely, learner is complete 
if for every g € Q there exists a so called characteristic sample CS q of q (w.r.t. learner) such 
that learner(CS' 9 ) returns q. Note that an unsavy user in the role of a teacher may not know 
exactly what is the characteristic sample, but rather attempt to approach it by adding more and 
more examples until the algorithm returns a satisfactory query. Consequently, it is commonly 
required for the characteristic sample to be robust under inclusion i.e., learner(S') should return 
q for any sample S that extends CS q while being consistent with q. Finally, polynomial restric- 
tions are imposed on learner and the size of the characteristic sample to ensure tractability of 
the framework. 

The primary goal of this paper is learning unary queries, but on the way there we also 
investigate the learnability of Boolean queries. Unary queries select a set of nodes in a document 
and are typically used for information extraction tasks. On the other hand, Boolean queries 
test whether or not a given document satisfies certain property, and their typical use case is 
the classification of documents e.g., for filtering purposes. When learning a Boolean query, an 
example is a tree with a marker indicating whether it is a positive or a negative example. 

Example 2 Consider a simple XML feed with offers from a consumer-to-consumer web site 
(Figure [2]) annotated by the user as either required (+) or forbidden (— ). A Boolean query 
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Figure 2: An annotated XML stream. 



satisfying the user annotations selects all sale offers i.e., q\ = .[of f er//item/type="For sale"]. 

□ 

We investigate the learnability for Boolean and unary, path and twig queries in the presence of 
positive examples only and in the presence of both positive and negative examples. For learning 
in the presence of positive examples only, we identify practical subclasses of anchored path 
queries and path- sub sumption- free twig queries that are learnable. The main idea behind our 
learning algorithms is to attempt to construct an (inclusion-) minimal query consistent with the 
examples. Intuitively this means that our algorithms try to construct a query that is as specific 
as possible with respect to the user input (cf. qo in Example [1]). This approach is common to 
a host of algorithms learning concepts from positive examples [3] including reversible regular 
languages [3], fc-testable regular languages [TS], and single occurrence regular expressions [5]. 
While our learning algorithms for path queries return minimal queries consistent with the input 
sample, we show that this approach cannot be fully adopted for twig queries because there are 
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input samples for which the consistent minimal twig query is of exponential size. Here, our 
learning algorithms return queries that can be seen as polynomially-sized approximations. 

The learnability of the full classes of path and twig queries remains an open question. However, 
we identify the essential properties of the query classes that enable our learning techniques, and 
observe that these properties do not hold for the full classes of path and twig queries. This 
indicates that new approaches may need to be explored if learning of the full classes is feasible 
at all. 

In the setting where both positive and negative examples are allowed, we study the con- 
sistency problem: given a document with a set of positive and negative annotations is there a 
query that satisfies the annotations? This problem is trivial if only positive examples are given 
because the universal query, that selects all nodes in a tree, is consistent with any set of posi- 
tive examples. However, as we show, adding even one negative example renders the consistency 
problem intractable. This result holds for all considered classes of queries, including anchored 
path queries and path-subsumption-free twig queries, and in fact, it holds for so simple classes 
of queries that it is hard to envision some reasonable restrictions that would admit learnability 
in the presence of positive and negative examples. 

The main contribution of this paper is defining and establishing theoretical boundaries for 
learning path and twig queries from examples. To the best of our knowledge this is the first work 
addressing this particular problem. Additionally, we investigate two problems that might be of 
independent interest: constructing a minimal query consistent with a set of positive examples and 
checking the consistency of a set of positive and negative examples. The characterization of the 
properties of the learnable classes of queries and the algorithm for learning unary path queries are 
based on existing techniques, tree pattern homomorphisms [UJ and pattern learning [21 137] . 
but we employ them in new, nontrivial ways. The remaining results, including the remaining 
learning algorithms and intractability of the consistency problem, are new and nontrivial. 

The paper is organized as follows. In Section[2jwe introduce basic notions and define formally 
the learning framework. In Section [3] we define the learnable subclasses of queries and identify 
their essential properties that enable our learning algorithms. In Sections |H through [JJ we present 
the corresponding learning algorithms. In Section |S] we discuss the impact of negative examples 
on learning. We discuss the related work in Section [5] Finally, we summarize our results and 
outline further directions in Section [10] Because of space restriction we present only sketches 
of the most important proofs; complete proofs will be given in the full version of the paper 
(currently in preparation for journal submission). 

Acknowledgments. We would like to thank our fellow colleagues and anonymous reviewers 
for their helpful comments. We also would like to thank Radu Ciucanu and Ioana Adam who 
implemented the algorithms and shared their insights allowing to improve theoretical properties 
of the algorithms. This research has been partially supported by Ministry of Higher Education 
and Research, Nord-Pas de Calais Regional Council and FEDER through the Contrat de Projets 
Etat Region (CPER) 2007-2013, Codex project ANR-08-DEFIS-004, and Polish Ministry of 
Science and Higher Education research project N N206 371339. 

2 Basic notions 

Throughout this paper we assume an infinite set of node labels £ which allows us to model 
documents with textual values. We also assume that £ has a total order, that can be tested in 
constant time, and has a minimal element that can be obtained in constant time as well. We 
extend the order on £ to the standard lexicographical order <i ex on words over E and define a 
well-founded canonical order on words: w < can u iff \w\ < \u\ or \w\ = \u\ and w <i ex u. 
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Trees We model XML documents with unranked labeled trees. Formally, a tree t is a tuple 
(Nt, root tl lab t , child t ), where N t is a finite set of nodes, root t G iV t is a distinguished root node, 
lab t : Nt — > £ is a labeling function, and child t C N t x N t is the parent-child relation. We 
assume that the relation childt is acyclic and require every non-root node to have exactly one 
predecessor in this relation. By Treeo we denote the set of all trees. 

The size of a tree is the cardinality of its node set. The depth of a node is the length of the 
path from the root to the node and the height of the tree is the depth of its deepest leaf. For a 
tree t by Paths(t) we denote the set of paths from the root node to the leaf nodes of t. We view 
a path both as a tree, in particular it has nodes, and as a word. Often, we use unranked terms 
over S to represent trees. For instance, the term r(a(b),b(a(c)),c(b(a))) corresponds to the tree 



to in Figure 3(a) 
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Figure 3: Trees. 



To represent examples and answers to queries, we use trees with one distinguished selected 
node. Formally, a decorated tree is a pair (t, selt), where t is a tree and self G Nt is a distinguished 
selected node. We denote the set of all decorated trees by Tree\. Figure [3(b) | contains two 
decorated versions of to: the selected node is indicated with a square box. In the sequel, we 
rarely make the distinction between standard trees and decorated ones, and when it does not 
lead to ambiguity, we refer to both structures as simply trees. 

Queries. We work with the class of twig queries, also know as tree pattern queries pQ. Twig 
queries are essentially unordered trees whose nodes may be additionally labeled with a distin- 
guished wildcard symbol * and that use two types of edges, child and descendant, corresponding 
to the standard XPath axes. To model unary queries we also add a distinguished selecting node. 



r 





(b) Unary path query po ■ 

Figure 4: Twig queries. 

A Boolean twig query q is a tuple (N q , root q , lab q , child q , desc q ), where N q is a finite set of 
nodes, root q G N q is the root node, lab q : N q — > E U {*} is a labeling function, child q C N q x N q 
is a set of child edges, and desc q C N q x N q is a set of descendant edges. We assume that 
child q H desc q = and that the relation child q U desc q is acyclic and require every non-root node 
to have exactly one predecessor in this relation. By Twig we denote the set of all Boolean twig 
queries. A unary twig query is a pair (g, sel q ), where q is a Boolean twig query and sel q G N q 
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(a) Boolean twig query qo. 
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is a distinguished selecting node. We denote the set of all twig queries by Twig^. Figure U 
contains examples of twig queries: child edges are drawn with a single line, descendant edges 
with a double line, and the selecting node is indicated with a square box. 

Additionally, we use restricted classes of Boolean and unary path queries, Patho and Pathi 
respectively. Formally, Pathi contains those elements of Twig i whose nodes have at most one 
child. Furthermore, the selecting node of a unary path query is always its only leaf (cf. Fig- 
ure |4(b)| . We note that Twig 1 captures exactly the class of descending positive disjunction- free 
XPath queries, and in the sequel, we use elements of the abbreviated XPath syntax [321 S3] to 
present both elements of Twig l and Twig . For instance, the query in Figure 4(a) can be written 
as r/-k[*]//a, and the query in Figure |4(b")| as r/*//a. 

Because no unary twig query can select at the same time the root node and another node of 
a tree, we disallow the root to be an answer, and from now on, we consider only unary queries 
and decorated trees whose selected node is other than root. Note that this restriction can be 
easily bypassed by adding a virtual root node to every tree in the input sample. Also, this way 
the universal query is *//*. 



Embeddings We define the semantics of twig queries using the notion of embedding which is 
essentially a mapping of nodes of a query to the nodes of a tree (or another query) that respects 
the semantics of the edges of the query. In the sequel, for two x, y G £ U {*} we say that x 
matches y if y ^ * implies x = y. Note that this relation is not symmetric: a matches * but * 
does not match a. 

Formally, for i G {0, 1}, a query q G Twigi and a tree t G Treei, an embedding of q in t is a 
function A : N q —> N t such that: 

1. \(root q ) = root t , 

2. for every (n, n') G child q , (X(n), X(n')) G child t , 

3. for every {n,n') G desc q , (A(n), A(n')) G {child t ) + , 

4. for every n G N q , lab t (X(n)) matches lab q (n), 

5. if i = 1, then X(sel q ) = sel t . 

Then, we write A : q <—t t or simply t =^ q. Figure [5] presents all embeddings of the query go in 
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Figure 5: Embeddings of qo in to. 



Note that we do not require the embedding to be injective i.e., two nodes of the query may 
be mapped to the same node of the tree. Embeddings of path queries are, however, always 
injective. Also, note that the semantics of //-edge is that of a proper descendant (and not that 
of descendant-or-self). 

Typically, the semantics of a unary query is denned in terms of the set of nodes it selects 
in a tree [251 126) : a node n of a tree t is an answer to a unary twig query q in t if there is 
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an embedding A : q <->• t such that X(sel q ) = n (then n is also said to be reachable by q in t). 
However, we use an alternative way of defining the semantics of a query. Formally, the language 
of a query q G Twig i for i G {0, 1} is the set 

£i(g) = {t G Tree H \ t =^ q}. 

Naturally, the two notions are very closely related e.g., the decorated trees t\ and ti (Figure 
belong to £i(po) (Figure 2]) and the nodes selected in t\ and t^ are exactly the answers to po in 
tree to. 

The notion of an embedding extends in a natural fashion to a pair of queries q,p G Twig i for 
some i G {0, 1}: an embedding of </ in p is a function A : iV g — > iV p that satisfies the conditions 1, 
2, 4, 5 above (with £ being replaced by p) and the following condition: 

3' for all {n,n') G desc q , (A(n), X(n')) G {child p U desc p ) + . 

Then, we write A : p » g or simply q ^ p and say that p subsumes q. 

The containment (or inclusion) q C p of two queries q,p G Twigi for i G {0,1} is simply 
£j(g) C Ci{p), and we say that q and p are equivalent, denoted g = p, if q C p and pCg. Note 
that for twigs, subsumption implies containment i.e., if q =4 p, then q Q p. The converse does not 
hold in general. For instance, we have a[.//b] C *[*] but a[.//6] ^ *[*]. There are also significant 
computational differences: the containment of twigs is coNP-complete [36j [30] whereas their 
subsumption is in PTIME. 

Query minimality In this paper we identify queries that are minimal for a given set of trees 
(as examples). It is important to emphasise that we always mean minimality in terms of query 
inclusion. Formally, for i G {0, 1}, a class of queries Q C Twigi, a query q G Q, and a set of trees 
S C Treei, we say that q is minimal query in Q consistent with S if S C £j(g) and there is no 
q' £ Q such that q' C q, q' ^ q, and S C £ 4 (g')- 

Learning framework We use a variant of the standard language inference framework [22J [2TJ 
I3"2l IT5] adapted to learning queries. A learning setting comprises of the set of concepts that are 
to be learnt, in our case queries, and the set of instances of the concepts that are to serve as 
examples in learning, in our case trees (possibly decorated). These two sets are bound together 
by the semantics which maps every concept to its set of instances. 

Definition 1 A learning setting is a tuple (T>, Q,C), where T> is a set of examples, Q is a class 
of queries, and £ is a function that maps every query in Q to the set of all its examples (a subset 
ofD). □ 

As an example, a setting for learning unary Twig queries from positive examples is the tuple 
( Tree i , Twig 1 , C\). This general formulation allows also to easily define settings for learning from 
both positive and negative examples, which we present in Section [S] 

To define formally what learnability for queries means we fix a learning setting K, = (T>, Q, C) 
and introduce some auxiliary notions. A sample is a finite nonempty subset S of T> i.e., a set of 
examples. The size of a sample is the sum of the sizes of the examples it contains. A sample S 
is consistent with a query q G Q if S C C(q). A learning algorithm is an algorithm that takes a 
sample and returns a query in Q or a special value Null. 

Definition 2 A query class Q is learnable in polynomial time and data in the setting K, — 
(T>, Q, C) iff there exits a polynomial learning algorithm learner and a polynomial poly such that 
the following two conditions are satisfied: 
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1. Soundness. For any sample S the algorithm learner(S') returns a query consistent with S 
or a special Null value if no such query exists. 

2. Completeness. For any query q G Q there exists a sample CS q such that for every sample 
S that extends CS q consistently with q i.e., CS q C 5 C £(q), the algorithm learner(5) 
returns a query equivalent to Furthermore, the size of CS q is bounded by poly(\q\). □ 

The sample CS q is often called the characteristic sample for q w.r.t. learner and fC but we point 
out that for a learning algorithm there may exist many samples fitting the role and the definition 
of learnability requires merely that one such sample exists. The soundness condition is a natural 
requirement but alone it is insufficient to eliminate trivial learning algorithms. For instance, for 
the setting where only positive examples are used, an algorithm returning the universal query 
*//* is sound. Consequently, we require the algorithm to be complete analogously to how it is 
done for grammatical language inference [HJ [331 [T3] . An alternative and natural way to ban 
trivial learning algorithms would be to require the algorithm to return some minimal query 
consistent with the input sample. Our approach follows this direction but as we show later on, it 
is not possible to fully adhere to it because there exist samples for which the minimal consistent 
twig query is of exponential size. 

3 Learnable query classes 

In this section we define the classes of queries, that in the following sections we prove learnable 
from positive examples, and identify two essential properties of these classes that enable our 
learning algorithms. Both properties follow from the importance of logical implication in learn- 
ing: learning can often be seen as a search of the correct hypothesis obtained by an iterative 
refinement of some initial hypothesis and at every iteration the current hypothesis is often a logi- 
cal consequence of the previous one. The first property requires the containment to be equivalent 
to subsumption, which allows to capture containment with a simple structural characterization. 
The second property is the existence of polynomially sized match sets [26] . which were originally 
introduced as an easy way of testing query inclusion. The match sets that we construct will serve 
us as the characteristic samples. We emphasise that the full classes of twig and path queries 
do not have these properties but this does not imply that they are not learnable but it merely 
precludes the direct adaptation of our learning techniques. Whether the full classes of queries 
are learnable remains an open question. 

To formally define the two properties, we fix a class of queries Q with their semantics defined 
by £. The properties are: 

(Pi) for every two qi,q 2 £ Q, <Zi C q 2 if and only if q\ =^ q 2 . 

(P2) every q € Q has a polynomial match set i.e., a set CS q of (positive) examples such that 
the size of CS q is polynomial in the size of q and for every q' 6 Q we have q C q' if and 
only if CS q C C(qj. 

We next present the construction of match sets in a generic form and then we introduce the 
learnable classes of queries and state the properties Pi and P2 for them. 

3.1 Match sets as characteristic samples 

We now present the construction of match sets that will be later on used as characteristic samples. 
Because the constructions of the match sets for all the subclasses of queries are very similar, we 
present it in a generic form. Take a twig query q, let N be the size of q, do be the minimal element 
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of E, and ai and a2 be two fresh symbols not used in q and different from ao- The constructed 
match set CS q contains exactly two trees: to is obtained from q by replacing every * with ao and 
every descendant edge by a child edge; t\ is obtained from q by replacing every * with a\ and 
every descendant edge with a path of length N whose all nodes are labeled with 02. Figure [6] 
contains the characteristic sample for the unary twig query qi = r /b{a//b]//c{d]/*/c. We point 

qi ■ r 
I 

b 

II \ 
c a 
/ I II 
d * b 





Figure 6: The characteristic sample for q\. 

out that for a query and a learning algorithm there might be more than just one characteristic 
sample. This is also the case with our learning algorithms. While the construction we present 
above might seem quite artificial, we use it due to its properties that might be of independent 
interest (match sets). Simpler, and easier to compose by a unskilled user, characteristic samples 
are often possible. 

3.2 Anchored path queries 

We begin with a base subclass of path queries, called anchored path queries. Essentially, a path 
query is anchored when no inner ★ node is incident to a //-edge. The main reason for introducing 
this class of queries is that when working with their embeddings the restriction on the use of // 
allows us to limit the "jumps" that the embedding may perform in between two nodes connected 
by a descendant edge. An additional restriction on the leaf node of Boolean path queries is 
imposed for technical reasons (cf. proof of Lemma [QI for more details). 

Formally, the class of unary anchored path queries imposes one restriction: a //-edge cannot 
be incident to a *-node unless it is the root node or the leaf node (which is also selecting) . For 
instance, the unary queries r//a//b/*/c, *//a//b/*, and *//* are anchored but the query r//a/*//b 
is not. An additional restriction is imposed on the Boolean anchored path queries: if the leaf 
node is *, then the edge incident to it is //. For instance, the Boolean queries a//b/*/c//* and 
a//b/*/c//a/*/b are anchored but the Boolean query *//o//6/* is not anchored. We denote by 
AnchPathi and AnchPatho the sets of unary and Boolean anchored path queries respectively. 

Clearly, the subclasses of anchored path queries are properly included in the full classes of path 
queries, however, we believe that the restrictions are not very limiting and the classes of anchored 
queries remain practical. Basically, anchored path queries cannot discriminate the descendants 
of a node based on their depth alone. We also point out that the additional restriction imposed 
on Boolean queries is quite minor: the Boolean query r//a/* is not anchored but it is equivalent 
to r//a//-k which is anchored. Note, however, that the Boolean query r//o/*/* does not have an 
equivalent Boolean anchored query. 

While Pi for anchored path queries follows from the results in [27l[26], below we present a 
proof using a technique that allows to show Pi and P2 for all the query classes we introduce 
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later on (and these results are new and cannot be derived from the results in [271 126] ). 
Lemma 3.1 Unary and Boolean anchored path queries have the properties Pi and P2. 
To prove this lemma it suffices to show the following claim. 

Claim 1 For any i £ {0, 1} and any two q,q' € AnchPathi, if CS q C d(q'), then there exists 
an embedding A : q' > q. 

Proof We first give an equivalent yet more structured definition of anchored path queries. 
A block is a path query fragment B of the form o~o/.../o~ n , where n > 0, o~o,o~ n S X, and 
ox, • • • j c n _i G X U {*}. An anchored path query q is a path query of the form Bq// B\// . . . //-B&, 
where fc > 0, Bi is a block for 1 < i < k — 1, and £?o is either a block that can start with * or a 
single occurrence of *. Also, in case of Boolean anchored path queries Bk is either a block or a 
single occurrence of * and in case of unary anchored path queries Bk is either a block that can 
end with ★ or a single occurrence of 

We first prove the claim for unary queries (i.e., i = 1). Let N — \q\ and CS q = {t n ,ti} be 
constructed as described in Section 1370 For every node n of t\ whose label is not a-i by origin(n) 
we denote the node of q corresponding to n. Also, fix Ai : q' t\ , 

We make several observations. First, \q'\ < N, or otherwise there would be no embedding of 
q' into to. For the same reason, q' does not use the labels a\ and 02- Therefore, if a node n of 
q' is mapped by Ai to a node with label a\ or 02, then lab q '(n) = *. 

Next, we show that Ai maps nodes of q 1 only to those nodes of t\ that are not labeled with 
02- This is clearly the case for the root node and the selecting node of q 1 , that are mapped to the 
root node and the selecting node of t\, and from the construction of ti, they have labels different 
from 02- In the following we show the this is the case with other nodes. 

Let q 1 be of the form Bq// Bi// . . . //Bk- Note that if a node n is on the border of Bj (for 
< j < k) then from the definition of a block n cannot be mapped to 02- This is because n is 
either a root node, or a selecting node or its label is not *. 

Suppose, that some node of q' belonging to Bi for < i < k, is mapped to a node with label 
<Z2 and let n\ and ri2 be the nodes that are on the borders of B. Because \B\ < \q'\ < N and 
in ti nodes labeled with 02 come in sequences of length N, one of the nodes n\ and U2 needs to 
be mapped to a node labeled with 02- This implies that one of n± and 77,2 is labeled with *; a 
contradiction. 

This shows that A = Ai o origin is a properly defined function mapping N q i to N q . We 
now show that A is an embedding of q' into q. The condition 2 holds because Ai preserves 
the child relation and if nodes (n\,n2) are in child tl and both are not labelled with 02 then 
(origin(ni), origin{n,2)) G child q . The conditions 1, 3, and 5 follow from the definition of A. For 
the condition 4, take any n G N q > such that a — lab q /(n) ^ ★ and note that then, Ai(n) in t\ has 
the same label a which is different from a\ (because q does not use a\) and 02 (as shown above). 
Therefore the node of q that corresponds to Ai(n) is labeled with a as well. 

The proof for Boolean anchored path queries is analogous and it suffices to consider the case 
when Bk is a single occurrence of *. Then indeed an embedding Ai : q' ^ t\ may map the *-leaf 
to a node labeled with 02- We note, however, that Ai can be easily altered to map the *-leaf to 
a non <Z2-node because the *-leaf is connected to with descendant edge and every 02 node in t\ 
has a descendant that is not labeled with 02 ■ □ 

To show Pi it is enough to show the implication from left to right. Assume q C q' and note that 
CS q C C{q). Therefore, CS q C £(g'), which by Claim [TJ gives us q =^ q' . P2 follows directly 
from Claim [TJ 
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3.3 Conjunctions of anchored path queries 

In our approach to learn twig queries we use path learning algorithms to infer a set of path 
queries satisfied in the input sample and then we combine these path queries into a twig query. 
Therefore, the midpoint between learning path queries and twig queries is learning conjunctions 
of path queries. We apply this technique only to learn Boolean twig queries and so we focus only 
on learning Boolean conjunctions of path queries. For convenience we use sets of Boolean path 
queries to represent conjunctions but a conjunction can also be seen as a Boolean twig query 
consisting of path queries meeting at the root node. The second representation is used to define 
the semantics of conjunctions and their characteristic samples. 

Because our path learning algorithms infer anchored queries, we consider only conjunctions of 
Boolean anchored path queries. Also, if we have inferred two Boolean path queries p\ and p2, and 
Pi subsumes P2, then from the point of learning there is no point in keeping p2 because p\ contains 
more specific information and makes P2 redundant. Consequently, we consider only reduced 
conjunctions i.e., having no two different pi,P2 such that p\ C p 2 . Naturally the conjunctions 
must be also head- consistent i.e., any two paths queries in a conjunction much have the same 
root label or otherwise we would not be able to represent it as a twig query. By ConjPath 
we denote the class of conjunctions of Boolean anchored path queries satisfying the restrictions 
described above. The use of anchored path queries allows to prove the following lemma in a 
manner analogous to the proof of Lemma 13.11 

Lemma 3.2 Conjunctions of Boolean anchored path queries have the properties Pi and P2. 

3.4 Path-subsumption-free twig queries 

As mentioned previously, our learning algorithms for twig queries attempt to construct the query 
q by combining the path queries from a conjunction inferred beforehand. Because we infer a 
reduced set of path queries, the constructed Boolean twig query q has no two pi,P2 S Paths (q) 
such that pi Cj) 2 , where Paths (q) is the set of Boolean path queries on paths from the root to 
all leaves of q. Naturally, all path queries in Paths (q) need to be anchored. Formally, a Boolean 
twig query q is path-subsumption-free iff Paths (q) is a reduced set of Boolean anchored path 
queries and by PsfTwig we denote the class of Boolean path-subsumption-free twig queries. 

The restrictions are relaxed slightly for unary twig queries and reflect our learning algorithm 
that first infers a unary anchored path, and next, decorates it with elements of PsfTwig used 
as filter expressions. Recall that the selecting path in a unary twig query is the path query 
on the path from the root node to the selecting node. Formally, a unary twig query q is path- 
subsumption-free iff the unary path query from the root node to the selecting node of q is anchored 
and every Boolean path query on the path ending at a (non-selecting) leaf node and beginning 
at the closest node on the selecting path is anchored. By PsfTwig 1 we denote the class of unary 
path-subsumption-free twig queries. 

The classes of path-subsumption-free twig queries may seem at first very limited. We note, 
however, that a twig query belongs to our class if every leaf label is different or every pair of 
leaves with the same label cannot be compared with =<; (and all paths are anchored). This simple 
sufficient condition yields a rather large class of twig queries used in practice, especially if we 
consider the following remark. One of the advantages of considering an infinite set of labels S is 
the ability to capture textual values (stored in the leaves of a tree). Then, non-selecting leaves 
of tree patterns are used for equality tests of text values, and rarely the same value is used to 
make an equality test (on similar paths). 

Lemma 3.3 Path-subsumption-free twig queries have the properties Pi and P2. 
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We point out that in the proof of Lemma 13.31 we only use the fact that path-subsumption-free 
twig queries are constructed from anchored paths. The other restriction, namely that the path 
queries in Paths(q) cannot subsume one another, is not used in this proof and it is essential 
only for the proper work of our learning algorithms. Our recent results show that learning is 
possible without this restriction and we intend to present these findings in the journal version of 
the paper. 

4 Learning unary path queries 

In Figurc[7Jwe present a learning algorithm for the learning setting AnchPathi = ( Treei, AnchPathi, C\) 
inspired by and extending several learning algorithms for regular string patterns [2, 37] (cf. Sec- 
tion [5] for more details). Recall that SelPath(t) stands for the path from the root node to the 
selecting node of t and extend it to samples SelPath(S) = {SelPath(t) \ t 6 S}. The algorithm 
begins with a universal path query *//* and considers only the paths from the root to the selected 
nodes in the input sample. It constructs the path query in three stages. 

algorithm learnerAnchPathi(S') 

Input: a sample S C Treei of decorated trees 

Output: a minimal p £ AnchPathi such that S C Ci(p) 

l: w := min< can (SelPath(S)) 

2: let w be of the form aa/ai/ ■ ■ ■ ja n 

3: p := *//* 

4: foreach subpath u of axja^j ■ ■ ■ /a n -i 
in the order of decreasing lengths do 

5: replace in p any //-edge by //u// as long as S C C\{p) 

6: let p be of the form &0//P0//&1 

7: if S C C 1 (p{bo <— a }) then 

8: p:=p{b Q ^a } 

9: if S C C\{p{bi <- a„}) then 

10: p-=p{bi^a n } 

li: foreach descendant edge a in p do 

12: find maximal I s.t. S C C\(p{a //(*/) e }) 

13: if S C d(p{a <- /{*/Y}) then 

14: p :=p{a <- /(*/) e } 

15: return p 

Figure 7: Learning algorithm for AnchPathi. 

In the first stage (lines 4 through 6) the algorithm attempts to identify a collection of factors, 
essentially path fragments, that are mutually common to every path in SelPath(S). Note that 
if a factor is present in every path, then it is also present in the < ca „-minimal path w. The 
candidate query p is gradually refined with the factors and the invariant is that these factors 
are mutually present on every path in SelPath(S) and in the specified order. For every new 
candidate w' , learnerAnchPathi attempts to find a place where w' can be inserted and yield a 
path query p' consistent with S. 

In the second stage (lines 8 through 11), the algorithm takes the query p and attempts to 
specialise the first and the last occurrences of wildcard i.e., replace them with the corresponding 
symbol taken from w. Here, p{x <— e} creates a copy of p and replaces in it the reference x by 
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expression e (the original p remains unchanged). In the third stage (lines 12 through 16) the 
algorithm attempts to specialize every //-edge in p i.e., replace it with a maximally long sequence 
/*/*/•••/*• 

Example 3 In this example we show the execution of the algorithm learnerAnchPathi on the 
sample {£1,^27^3} presented in Figure [8] together with path queries constructed during the exe- 
cution. 
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Figure 8: A sample and the constructed queries. 



In the first stage the algorithm identifies a factor b/c present in every selecting path and the 
resulting path query is pq — *//b/c//-k. There is no other common factor and the algorithm moves 
to the second stage where it specializes the root node of po obtaining this way p\ = r//b/c//-k; the 
selecting node cannot be specialized because the selected nodes of t\ and have two different 
labels, a and c resp. Finally, the algorithm attempts to specialize the descending edges. Only 
the top one can be replaced by a *-path of length 1, yielding P2 — r /*/b/c//*, which is also the 
final result of the learning algorithm. □ 

There are aspects of the algorithm that are not fully specified e.g., from two different subpaths 
of the same length which one should be chosen first in the loop in line 4. We do not enforce any 
particular choice because it is inessential from the theoretical point (soundness and completeness) 
and in practical implementations the choice could be made with the help of heuristics. 

Example 4 Consider the sample consisting of the two trees: r(a(b(c(d)))) and r(b(c(a(b(d))))) . 
In the first stage the algorithm may identify either a factor a/6 or b/c but not both of them. As a 
consequence the algorithm may return one of two possible queries p\ = r//ab//d or p^ = r//bc//d. 
In order to make the algorithm deterministic we may enforce some order of processing among 
candidate factors of the same length e.g. from left to right. □ 

We observe that S C C\(p) is an invariant maintained throughout learnerAnchPathi and with 
a simple analysis one can show that learnerAnchPathi is sound for AnchPathi. But what makes 
this algorithm particularly interesting is the following. 

Lemma 4.1 The algorithm learnerAnchPathi returns a minimal anchored path query consistent 
with the input sample. 

We prove the claim below, which by Pi for AnchPathi is equivalent to the lemma above. 

Claim 2 // learnerAnchPathi (S) returns p, then there is no unary anchored path query q =/= p 
such that q =<; p and S C Ci(q). 
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Proof Suppose otherwise and take a unary anchored path query q =^ p having an embedding 
A : p <—} q. We note that q can be viewed as result of applying a substitution 9 i.e., q = p9, which 
substitutes in p some the labels of ★-nodes with labels in E and replaces some of the //-edges 
with path queries. This substitution can be decomposed into a composition 9 — 9\ o 2 o . . . o k 
of atomic operations, which for brevity we present here as rewriting rules: 1) replacing a 

//-edge with a child edge, 2) // H> //B// replacing a //-edge with a block B (cf. proof of ClaimQ]), 
3) * h-> a changing the * label of a node to some a 6 S, 4) // h-> /*/*/ . . . /*/ replacing a //-edge 
with a *-path. Now, if we take the path query q' = p9i, then q =<! q' =<; p and q' ^ p. Consequently, 
it suffices to assume that p is obtained from q by applying just one atomic substitution 9°. 

Essentially, the first three types of atomic operations allow to identify a new factor or a longer 
factor that would have been discovered and properly incorporated into the resulting query during 
the execution of learnerAnchPathi(S') in lines 4-6. The last type, // h-> /*/ . . . /★ allows to identify 
a descending edge that would have been converted to a *-path in lines 13-17. These arguments 
show that p could not have been the result of learnerAnchPathi(S'); a contradiction. □ 

We argue that Lemmas 13.11 and 14 . 1 1 imply completeness of learnerAnchPathi w.r.t. AnchPathi. 
Indeed, if CS q C S and learnerAnchPathi (S 1 ) returns p, then q C p because CS q C L\(p) but 
there is no query q' C p that S C C(q'). Hence, q and p are equivalent. 

Theorem 4.2 Anchored path queries are learnable in polynomial time and data from positive 
examples (i.e., in the setting AnchPathi). 

5 Learning Boolean path queries 

Learning Boolean path queries is more challenging than learning unary path queries. In decorated 
trees, which are examples for learning unary queries, the selected nodes unambiguously indicate 
the path to be matched by the query. The examples for learning Boolean path query are trees 
with no indication of the path the constructed query should match. To address this problem we 
devise an algorithm that infers a conjunction of Boolean anchored path queries that are satisfied 
in the given sample. Recall that AnchPathi is the class of Boolean anchored path queries 
and ConjPath Q is the class of reduced and head-consistent conjunctions of Boolean anchored 
path queries (represented as sets of Boolean path queries) . The corresponding learning settings 
are AnchPatho = (Treeo, AnchPatho, Co) and ConjPath = (Treeo, ConjPath , Cq), where Cq 
interprets a set of path queries P as a the twig query obtained by gluing the root nodes together. 

Figure [9] contains the learning algorithms for ConjPath and AnchPatho- First, we introduce 
learnerAnchPatho, a helper learner derived from learnerAnchPathi, which infers a minimal Boolean 
anchored path query that is satisfied by the given path u and every tree in the input sample. 
Note that to ensure that the output is a Boolean anchored query learnerAnchPatho skips the 
specialization of the last //-edge if doing so would yield a query that is not anchored (i.e., ending 
with * not preceded immediately by //). The purpose of taking the initial path u from the input 
is the ability to consider every path in S as the word in which to search for common factors. 

Essentially, learnerConjPath considers every path u in tree of S and uses learnerAnchPathi 
to find a most specific (i.e., minimal) Boolean path query p satisfied by u and every other 
element of S. The set P aggregates all minimal results of running learnerAnchPatho over all 
paths in the input sample. The learning algorithm learnerAnchPatho simply takes the result of 
learnerConjPath and chooses one element. The choice is arbitrary, but later, we show that in 
the presence of the characteristic sample learnerConjPath returns a singleton and there is no 
ambiguity. 

Example 5 We run learnerConjPath on the sample Sq (Figure ITU]) corresponding to the positive 



14 



algorithm learnerAnchPath^u, S) 

Input: a path u and a sample S C Treeo of trees 
Output: a minimal p G AnchPatho s.t. 5 U {it} C Co(p) 
This algorithm is obtained from learnerAnchPathi by: 

• initializing w to u (line 1) 

• replacing every S C by S 1 U {w} C £o(p) 

• skipping the execution of loop 13-17 for 
the last //-edge if b\ = *. 

algorithm learnerConjPath (S') 
Input: a sample S C Treeo of trees 
Output: a set of minimal queries P C AnchPatho 
such that 5 C £ (P) 

1: P:=0 

2: for u e Paths (S) do 

3: p := learnerAnchPath^w, 5) 

4: ii $q E P. q =4 p then 

5: P:=P\{?£ P 

6: P := P U {p} 

7: return P 

algorithm learnerAnchPatho(S') 

Input: a sample 5 C Treeo of trees 

Output: a minimal p G AnchPatho such that 5 C £o(i?) 

i: P := learnerConjPath (5) 

2: choose any p from P 

3: return p 

Figure 9: Learning ConjPath and AnchPatho- 

examples from Example [5] simplified for clarity of presentation. The set of paths Paths(So) in 
the sample consists of: 

til = of f er/item/f or-sale, 
U2 = of f er/item/descr, 
Us = of fer/list/item/f or-sale, 
U4 = of f er/list/item/descr, 
?i5 = offer/list/item/wanted. 

Running learnerAnchPathp on those paths yields: 

learnerAnchPatho(iii, So) = off er//item/f or-sale, 
learnerAnchPathg(it2, So) = of f er//item/descr, 
learnerAnchPathQ(it3, 5o) = of fer//item/f or-sale, 
learnerAnchPathQ(it4, So) = of f er//item/descr, 
learnerAnchPathp^s, So) = of f ex //item//*. 
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Figure 10: Input sample from Example [5] 



Note that the result of learnerAnchPathQ on U5 is the Boolean anchored query of f er//item//* 
and not the more specific of fer// it em/* because it is not anchored; learnerAnchPathQ skips the 
attempt to specialize the last //-edge because it is followed by *. The query of f er//item//* 
is, however, subsumed by all the previous queries, and therefore, learnerConjPath (S'o) returns 
a set containing only the queries of f er//item/f or-sale and of f er//item/descr. The run of 
learnerAnchPatho on So returns one of those queries e.g., the one whose string representation 
is lexicographically minimal of f er//item/descr. While this is not best choice for Example [2J 
the negative examples can be used in a heuristic to select a query rejecting the most negative 
examples, in this case of fer//item/f or-sale. □ 

Because learnerConjPath (5) returns a set P of Boolean path queries that are satisfied in 
every tree in S, this algorithm is sound. Naturally, learnerAnchPatho is also sound because it 
returns one element of P. To show completeness of both learning algorithms, we point out an 
important property of learnerAnchPathQ. The construction of characteristic samples CSp and 
CS P is in Section [XT] 

Lemma 5.1 Take a conjunctive query P £ ConjPath a , let CSp = {^01^1} fre the character- 
istic sample for P, and take any sample S C Cq(P) containing two examples t' ,t'-y such that 
Paths (U) = Paths(t' t ) for i G {0, 1}. Then, 

1. for every u G Paths(S), learnerAnchPathj^u, S) returns a path query equal to or subsumed 
by some p G P. 

2. for every p G P there exists u G Paths(S) such that learnerAnchPath^u, S) returns p. 

The above result shows completeness of learnerConjPath . As for learnerAnchPatho, if we take a 
Boolean anchored path query p and apply the previous lemma to P — {p}, we get that for any 
sample S consistent with p and containing CS P the algorithm learnerConjPath (S') returns the 
singleton {p}, and thus, learnerAnchPatho(S') returns p. This result allows to prove learnability 
of both classes of queries. 

Theorem 5.2 The query classes ConjPath and AnchPatho are learnable in polynomial time 
and data from positive examples (i.e., in the settings ConjPath and AnchPatho resp.) 

We also show minimality of learnerAnchPatho- 

Lemma 5.3 For any finite S C Treeo, learnerConjPath (S') returns a set of minimal Boolean 
anchored path queries consistent with S and learnerAnchPatho(S') returns a minimal Boolean 
anchored path query consistent with S. 

We point out that while the result of learnerConjPath (S') is a set of minimal queries, it is not 
necessarily a minimal conjunctive query i.e., it is not a maximal set of minimal queries. In the 
example below we show that a set of positive examples may have an exponential number of 
minimal Boolean path queries, and therefore, constructing their conjunction cannot be done in 
polynomial time. 
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Example 6 Fix n > and take the set of positive examples S exp of containing exactly two trees 

t = r(a x (6i(. ..a n (b„(c)) . . .))), 
h = r(6 1 (ai(. . .6„(a„(c)) . . .))). 

Any query of the form r//j3\// . . . ///3 n //c, with /3j € {ai,b{\ and i G is a minimal 

Boolean path query consistent with S exp . □ 



6 Learning Boolean twig queries 

It this section we investigate learning path-subsumption-free twig queries from positive examples 
i.e., the learning setting PsfTwig = (Treeo, PsfTwig Q , Co). Recall that q e PsfTwig Q is query 
such that the set of root-to-leaf paths Paths (q) consists of Boolean anchored path queries and 
does not contain two path queries such that one subsumes another. Our approach is based on the 
algorithm learnerConjPath , which infers a set P of minimal Boolean path queries and a method 
that allows to reconstruct a twig query from path queries in P. Intuitively speaking, we shall 
interleave the path queries from P to obtain the twig query. Below, we describe formally this 
technique. 

Given a path query p and a node n G N p , the split of p at n is a pair of path queries p\ and 
Pi such that pi is the path from root p to n and pi is the path from n to the only leaf of p. Note 
that n becomes the root node of P2 • A fusion of p into a twig query q is a twig query q' such that 
the pair p\ and P2 is a split of p at n, there exists an embedding A : p\ q, and q' is obtained 
from q by attaching p2 at node A(n) (the node \(n) and the root node n of P2 become the same 
node, the label of n in p 2 is ignored). By Fusions(p, q) we denote the set of all fusions of p into 
q. Figure [TT] presents all fusions of r//a/b into r[* / a] // a/ c. 
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Figure 11: Fusions of po into go- 



We point out that if q is path-subsumption-free and p is anchored, then all elements of 
Fusions(jp,q) are path-subsumption-free. We note that Fusions(p,q) may be empty e.g., there is 
no fusion of a/a into b[a]/b, but as we argue next, this is never the case in the learning algorithm 
learnerPsfTwigQ which we present in Figure [121 We slightly extend the notation: denotes a 
phantom empty twig query and Fusions(0,p) = {p}. 

Basically, learnerPsfTwig uses learnerConjPath to construct a set P of Boolean path queries 
satisfied in all trees of S and then fusions all the paths into one twig query. Note that C is never 
empty because q is build up from path queries in P that are satisfied in S and have the same 
label in their root nodes. Consequently, learnerPsfTwigg executes without errors and is sound. 
The order in which learnerPsfTwigg performs fusions is arbitrary, but later on, we show that in 
the presence of the characteristic sample, the set C has exactly one element at all times, and the 
final result is the goal query. First, we illustrate the work of learnerPsfTwig on an example. 
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algorithm learnerPsfTwig (S') 
Input: a sample S C Treeo of trees 
Output: a query p G PsfTwig Q such that 5 C £o(p) 
i: g := 

2: P := learnerConjPathg(S') 
3: for p E P do 

4: C := {<?' G Fusions(p, q) \ S C £ (g')} 
5: g := choose any ^-minimal element of C 
6: return g 



Figure 12: Learning algorithm for PsfTwig . 
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Figure 13: Input sample 



Example 7 Consider a sample S\ containing two DBLP listings in Figure Q2J one with a col- 
lection of articles and the other with a collection of books. 
learnerConjPath (S'i) returns the following path queries: 

pi = dblp/*/author, p2 = dblp/*/title, ps = dblp/*/url. 

We perform fusions in the order pi, p2, andp3. Fusing p\ andp2 yields the query dblp/*[title]/author 
and fusing p^ into it gives q' = dblp[*/url]/*[title]/author. Note that in the last step, 
dblp/*[title] [url]/author is one of the fusions but it is not consistent with the input sam- 
ple Si. On the other hand, if the order of fusions is P2, P3, and pi, then the end result is 
q" = dblp[*/author]/*[title]/url. □ 

While in the previous example the queries q' and q" are minimal path-subsumption-free twig 
queries consistent with Si, in general learnerPsfTwig does not need to produce such minimal 
queries. In fact, we show that for certain samples, such a minimal query may be of exponential 
size and thus impossible to construct by a polynomial algorithm. 

Example 8 (cont'd Example [6]) Recall the sample S cxp and observe that the minimal twig 
query consistent with .S^p has the shape of a perfect binary tree of height n + 1 where every 
node at depth i G {0, . . . ,n — 1} has two children labeled with aj+i and bi+i (connected with 
their parent with a //-edge). Naturally, this minimal query is path-subsumption-free. □ 

Now, we move to completeness of learnerPsfTwig and we fix a query q G PsfTwig and a sample 
S C Co(q). Recall the construction of the characteristic sample CS q for q from Section l3~Tl First, 
we observe that for q G PsfTwig every p G Paths(q) is a ^-minimal element of Paths(q). As a 
simple consequence of Lemma 15.11 we get the following. 

Lemma 6.1 If S contains CS q , then learnerConjPath (S') returns Paths(q). 

To state that the algorithm approaches the goal query q with every fusion, we need to define 
formally the search space of subqueries of q and show that when moving with the fusion operator 
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we never leave the space and finally reach q. A Boolean twig query q' is a subquery of q if there 
exists a subset N of leaves of q such that q' is a subgraph induced by the set of paths from the 
root of q to the leaves in N. The main claim follows. 

Lemma 6.2 Assume that CS q C S. For any subquery q' of q, and any path query p G Paths (q)\ 
Paths{q') the set of elements of Fusions{p,q') consistent with S has exactly one ^-minimal 
element q" . Furthermore, q" is a subquery of q. 

If CS q C S, then by Lemma \6. II P = Paths (q), and therefore, whatever is the order of choosing 
paths from P in line 3, the algorithm learnerPsfTwig approaches q and when all paths in P are 
fused, we obtain q. 

Theorem 6.3 Path- sub sumption- free Boolean twig queries are learnable in polynomial time and 
data from positive examples {i.e., in the setting PsfTwig ). 

7 Learning unary twig queries 

In this section, we present an algorithm learnerPsfTwigj (Figure H3|) for learning unary path- 
subsumption-free twig queries from positive examples i.e., in the learning setting PsfTwigj = 
(Treei,PsfTwig 1 ,£ 1 ). 

Essentially, the learning algorithm uses learnerAnchPathi to construct a path query p and then 
it uses learnerPsfTwig*, a helper learner derived from learnerPsfTwig , to decorate the nodes of 
the path query with filter expressions (Boolean twig queries). Here, we use the non- abbreviated 
syntax of XPath to represent the path query p as to/ot\"£\/ . . . /ak"£k> where ^€EU {*} and 
cti is either child or descendant. 

When decorating the i-th step of p i.e., the fragment ctf.'.ii, with a filter expression, the 
algorithm first constructs a sample Si of subtrees that serve as positive examples for learning the 
corresponding filter expression. From every decorated tree in the input sample S one subtree is 
extracted. Each subtree is rooted at a node n on the path from the root node to the selected node 
of the decorated tree t. The choice of n is done so that it can be reached with the unprocessed 
part of the path query i^j a\.:i\j . . . /af.:ti and at the same time the decorated part of the path 
query q[ selects the selected node selt when evaluated from n. An important invariant of the 
outer for loop (lines 4-12) is that there is at least one such n for every t £ S. If there is more 
than one possible choice, the deepest node is chosen. 

Example 9 Consider a sample S2 (Figure [T5|) that contains the positive examples corresponding 
to (a simplified version of) the document from ExampleQ] learnerAnchPathi^) returns the query 
p = /library /*/title. The algorithm attempts to specialize the bottom fragment q' 2 = title 
using the two subtrees title(capital) and title(manif esto). The only Boolean anchored 
path query these subtrees do have in common is title//*, which is fused into the query yielding 
qi = title[.//*]. Next, the algorithm moves to q[ = */title[.//*] and calls learnerPsfTwig* with 
two subtrees: one at the node collection and one rooted at the node book. learnerConjPath 
called with these two trees on input returns two path queries */title//* and ★/author/marx. 
The first path query is subsumed by q[, and therefore, it is absorbed by q[ when fusing. Fusing 
the second path query into q[ yields the query q\ = ★[author/marx]/title[.//*]. Finally, the 
algorithm moves level up to the query q' — qo = library /*[author/marx]/title[.//*], which is 
also the end result of learnerPsfTwig!. □ 

We observe that go can be considered as overspecialized: it contains the filter expression [.//*] 
which tests that the selected title nodes have contents, a test trivially true in the presence of 
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algorithm learnerPsfTwig^(5, q 1 ) 

Input: a sample S C Treei of decorated trees and 

a query q' G PsfTwig 1 such that S C Ci(q') 
Output: a query g G PsfTwig 1 s.t. q q' and 5 C 
This algorithm is obtained from learnerPsfTwig by: 

• initializing g to q' (line 1) 

• replacing every Cq by £i 

algorithm learnerPsfTwig 1 (S') 
Input: a sample S of decorated trees 
Output: a query q G PsfTwig 1 such that S C £i(<7) 
i: p := learnerAnchPathi(S') 

2: let p be of the form £o/a>i::£i/ . . . /a^'-lk 

3: 9fe := 4 

4: for i = k, . . . , do 

5: Si := 

6: for t G 5 do 

7: let n be the deepest node on the path from 

the root node root t to the selected node sel tl 
such that n is reachable from root t with 
1§I a.\\:l\l . . . I OLiV.li and seZ t is reachable 
from n with g- 
8: add the subtree of t rooted at n to Si 
9: qi := learnerPsfTwig 1 (6'i,g-) 
10: if i > then 
li: := li-x/oti-.-.qi 

12: return g 

Figure 14: Learning algorithm for PsfTwig 1 . 

a reasonable schema information. Currently, however, our algorithms do not take advantage of 
schema information. 

The soundness of learnerPsfTwig]^ follows from the invariant of the main loop (lines 4-12): 
for every t G S in line 7 there is at least one node with the desired property. Completeness 
of learnerPsfTwig-L follows essentially from completeness of the algorithms learnerAnchPathi and 
learnerPsfTwig , and from the fact that in line 7 we chose the deepest node. 

Theorem 7.1 Path- sub sumption- free unary twig queries are learnable in polynomial time and 
data from positive examples [i.e., in the setting PsfTwig-jJ. 

8 Impact of negative examples 

In the previous sections, we considered the setting where the user provides positive examples 
only. In this section, we allow the user to additionally specify negative examples. We use 
two symbols + and — to mark whether an example t of some query is a positive one (t, +) 
or a negative one (t, —). Formally, for i G {0, 1} we consider the following learning settings: 
Pathf = (Treef,Pathi,£f) and Twigf = (Treef, Twig^Cf), where Treef = Tree, x {+,-} 
and C?(q) = C t (q) x {+} U (Tree, \ C t {q)) x {-}■ 
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library library 
I 

collection book 



title 



title 



author author 



author 
I 

capital marx manifesto marx engels 

Figure 15: Examples from a library database 

We study the problem of checking whether there even exists a query consistent with the input 
sample because any sound learning algorithm needs to return Null if and only if there is no 
such query. Formally, given a learning setting /C = (T>,C,£), the /C- consistency is the following 
decision problem 

CONSk = {S C V | 3q e C. S C C(q)}. 

Note that in the presence of positive examples the consistency problem is trivial as long as the 
query class contains the universal query *//*. In the presence of negative examples this problem 
becomes quite complex. 

Theorem 8.1 Twig^ -consistency is NP- complete for any i 6 {0, 1} (even in the presence of one 
negative example). 

Proof We only outline the proof of NP-hardness of Twig^-consistency with a reduction from 
SAT. Showing the membership to NP is more difficult, uses a nontrivial minimal-witness argu- 
ment, and is omitted. 

We illustrate the reduction on an example of a CNF formula ipo = (—<Xi Wx2 V ^x?,) f\{x\ V^a^) 
for which the corresponding sample is presented in Figure JTH] (positive and negative examples 
are indicated with the symbols -I- and — respectively). 





/I\ /l\ /l\ /l\ /l\ /l\ /l\ /l\ 

xi x 2 x% xi x 2 x$ xi x 2 x% x\ x 2 x 3 x\ x 2 x% x± x 2 x 3 x\ x 2 x% x\ x 2 cc 3 

I A A A I A A A I I A A A I A A A A A A A 

1 101 1 01 101 1 1 1 1 01 01 1 1 0101 1 

Figure 16: Reduction of SAT to Twigg -consistency for tpo = (~>xi V X2 V -<xs) A (xi V ~>X2)- 

The building block of the reduction is a brush tree which is used to encode Boolean valuations 
and constraints on them. For instance, for the set of variables {xi, X2, X3} the full brush tree is 
d(xi(0, 1),X2(0, 1), £3(0, 1)) but typically we remove some of the leaves. For instance, the valua- 
tion Vo = {(x 1, false), (x 2, false), (X2, true)} is represented by the tree to — d(xx(0), £2(0), ^3(1))- 
Note that the tree pattern c(to) separates the positive examples from the negative ones in Fig- 
ure [TH] because Vq satisfies ipo- 

The constructed set of examples consists of several c-trees. The positive c-trees specify the 
satisfying valuations of the input CNF formula; there is one c-tree per clause of the input formula. 
Each c-tree contains one brush tree per literal of the clause, every brush tree encoding the 
valuations that satisfy the corresponding literal (one leaf removed). The negative c-tree ensures 
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that a brush filter that separates the positive examples from negative is well-formed and encodes 
a valuation. This c-tree contains one brush tree per variable of the input formula, every brush 
tree has both leaves of the corresponding variable Xi removed. We claim that this set of examples 
is consistent if and only if the input CNF formula is satisfiable. The if part is trivial and the 
proof of the only if part is technical and uses the observation that the depth of any twig query 
separating the positive examples from negative ones is bounded by 4. □ 

The result holds even for very limited query classes that do not use //-edges and *, and in 
particular the result hold for path-subsumption-free twig queries. 

The problem of consistency of the input sample in the presence of positive and negative 
examples has also been consideed for string patterns and found to be NP-complete :2Sj. The 
proof can be easily adapted to show the following. 

Theorem 8.2 Path^ -consistency is NP-complete for any i € {0, 1}. 

We remark, however, that the proof cannot be extended to twig queries because these are much 
more expressive even when interpreted over linear trees (words). 
Overall, the negative results for checking consistency give us 

Corollary 8.3 Unless P = NP , none of the classes Pathi and Twig^^ for i £E {0, 1} is learnable 
in polynomial time and data in the presence of positive and negative examples. 

9 Related Work 

Our research adheres to computational learning theory [22] , a branch of machine learning, and 
in particular, to the area of language inference |21) . Our learning framework is inspired by the 
one generally used for inference of languages of word and trees [3T] [33J (see also [U] for survey 
of the area) . Analogous frameworks have been employed in the context of XML for learning of 
DTDs and XML Schemas 9, 8j, XML transformations [23J, and n-ary automata queries [TTj . 

Because the positive examples are generally believed to be easier to obtain, learning from 
positive examples only is desirable. However, many classes of languages are learnable only in the 
presence of both positive and negative examples e.g., regular languages [21] . deterministic regular 
languages [5] are not learnable from positive examples only, in fact any superfinite language class, 
a class containing all finite languages and at least one infinite, cannot be learned from positive 
examples even if we consider algorithms that do not work in polynomial time. To enable learning 
from only positive examples various restrictions have been considered e.g., reversible languages [5], 
A:-testable languages [IS], languages of fc-occurrence regular expressions [SJ, and (k, ^-contextual 
tree languages [31). What is important to point out here is that the ability to learn subclasses 
of path and twig queries from positive examples comes from the fact that the expressive power 
of path and twig queries is relatively weak. Paradoxically, the very same fact is also responsible 
for the unfeasability to learn path and twig queries when both positive and negative examples 
are present. 

Our basic learning algorithm for unary embeddable path queries is inspired and can be seen 
as an extension of algorithms for inference of word patterns [2[ [37] (see [35J for a survey of the 
area). A word pattern is a word using extra wildcard characters. For instance, regular patterns 
use a wildcard © matching any nonempty string e.g., affi&ffic matches aabbc and abbbc but not 
abbe, abc, and cbc. Extended regular patterns use a wildcard © that matches any (possibly empty) 
string e.g., a®b matches ab and acbeb. To capture unary path queries we need to use the wildcard 
© and another wildcard that matches a single letter, and then for instance the pattern affi&0c 
corresponds to the path query /a//b/-k/c when interpreted over paths of the input tree. We 
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observe that © is equivalent to ©0 and engineer our learning algorithm using the ideas behind 
the algorithms for inference of regular patters [2] and extended regular patters |37j . 

Learning of unary XML queries has been pursued with the use of node selecting tree au- 
tomata [11] , with extensions allowing to infer n-ary queries [24] , take advantage of schema infor- 
mation [T2], and use pruning techniques to handle incompletely annotated documents [TT]. The 
main advantage of using node selecting tree automata is their expressive power. Node selecting 
tree automata capture exactly the class of n-ary MSO tree queries [101 [21], which properly in- 
cludes twig and path queries. However, tree automata have several drawbacks which may render 
them unsuitable for learning in certain scenarios: this is a heavy querying formalism with little 
support from the existing infrastructure and it does not allow an easy visualization of the inferred 
query. 

Although, the class of twig queries is properly included by the class of MSO queries and 
path queries are captured by regular languages, using automata-based techniques to infer the 
query and then convert it to twigs is unlikely to be successful because automata translation 
is a notoriously difficult task and typically leads to significant blowup [16] and it is generally 
considered beneficial to avoid it [TT]- An alternative approach, along the lines of [5], would be 
to define a set of structural restrictions on the automaton that would ensure an easy translation 
to twig queries and enforce those conditions during inference. However, such restrictions would 
need to be very strong, at least for twig queries, and this approach would require significant 
modification of the inference algorithm, to the point where it would constitute a new algorithm. 

Methods used for inference of languages represented by automata differ from the methods 
used in our learning algorithms. An automata-based inference typically begins by constructing 
an automaton recognizing exactly the set of positive examples, which is then generalized by a 
series of generalization operation e.g., fusions of pairs of states. To avoid overgeneralization of 
the automata, negative examples are used to filter only consistent generalizations operations |32) , 
and if negative examples are not available, structural properties of the automata class can be 
used to pilot the generalization process [3J [T5] [8]. Our algorithms, similarly to word pattern 
inference algorithms [21 137) , begin with the universal query and iteratively specialize the query 
by incorporating subfragments common to all positive examples. 

XLearner [25] is a practical system that infers XQuery programs. It uses Angluin's DFA 
inference algorithm 5 to construct the XPath components of the XQuery program. The system 
uses direct user interaction, essentially equivalence and membership queries, to refine the inferred 
query. Because of that the learning framework, called the minimally adequate teacher [5], is 
different from ours and allows to infer more powerful queries. We also point out that learning 
twigs is not feasible with equivalence queries only [TP] . 

Raeymaekers et al. propose learning of (fc, ^-contextual tree languages to infer queries for 
web wrappers [33]. (k, Z)-contextual tree languages form a subclass of regular tree languages that 
allows to specify conditions on the nodes of the tree at depth up to / and each condition involves 
exactly k subsequent children of a node. Because only nodes at bounded depth can be inspected 
and the relative order among children is used, (fc, Z)-contextual tree languages are incomparable 
with twig queries which can inspect nodes at arbitrary depths but ignore the relative order of 
nodes. 

Finally, we point out that the problem of query inference has been studied in the setting 
of relational setting [351 1311 EH]- Relational databases and their query languages offer a set 
of opportunities and challenges radically different from those encountered in semi-structured 
databases. For instance, the query inference involves constructing a desired selection condition 
that yields the required tuples from a table, a task that easily becomes intractable. 
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10 Conclusions and future work 



We have studied the problem of inferring an XML query from examples given by the user. We 
have investigated several classes of Boolean and unary, path and twig queries and considered two 
settings for the problem: one allowing positive examples only and one that allows both positive 
and negative examples. For the setting with positive examples only, we have presented sound 
and complete learning algorithms for practical subclasses of queries: anchored path queries and 
path-subsumption-free twig queries. On the other hand, inclusion of negative examples to the 
input sample renders learning unfeasible. 

We believe that negative examples have an important informative quality and we intend 
to investigate approaches that take advantage of it. Two directions are possible: relaxing the 
definition of learnability and extending the query class. A notion allowing the query to select 
some negative examples and omit some positive examples is a natural direction of making our 
learning algorithms capable of producing queries of better quality (cf. Example [5]) and able to 
handle noisy samples. For the second direction, our preliminary results show that adding union 
to the query languages renders consistency quite simple to decide but the satisfaction of Pi 
and P2 is not clear, and therefore, new learning techniques need to be developed. We are also 
interested in extending the query language with other operators (e.g.. negation) and see their 
impact on learnability. 

We observe that the main reason for restricting our attention to anchored path queries are the 
properties Pi and P2 defined in Section [3] that allow to use embeddings to equate the semantics 
of the query with its structure and enforce the existence of match sets of polynomial size. |27U26j 
introduced adorned path queries, allowing to represent //*/* as //- 2 , and extended embeddings to 
homomorphisms of adorned queries. Homomorphisms are shown to connect tightly the structure 
of path queries and their semantics (Pi). It would be interesting to see to what extent the 
notion of homomorphism could be used to improve learnability results. We point out that for 
path queries the only know construction of match sets produces exponential sets. Moreover, the 
homomorphism technique does not work for twig queries. 

Finally, we would like to enable our algorithms to take advantage of schema information (cf. 
Example in]). The schema may be given explicitly e.g., as a DTD, or implicitly as a result of a 
learning algorithm. Because testing the containment of XPath queries in the presence of DTDs is 
know to be intractable in general [151 1301 [7] and in fact most of the reductions showing hardness 
use (or can be modified to use) anchored queries, the use of DTDs in this context may be quite 
limited and we intend to investigate alternative schema formalisms tailored for query learning. 
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